Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec 01 14:46:51 2021"

My view


Myself

My name is Dongming Zhang, I am a doctoral student at University of Helsinki. My project is working on the genetic study of fungi, such as Aspergillus niger, Trichoderma reesei.

Study goals

To be honest, I have the experiences about R, Python, Matlab. But I am not a professional statistics student, my programming ability is limited within how to run the scripts, writing some simple functions Many of programming questions or underlined problems that I did not understand clearly. My supervisor – from department of Microbiology– introduced this amazing course for me. Thus, I would like to study the detailed knowledge about R in this course. In my project, I can proficiently use R to analyze my bioinformatic data.

Regression and model validation (week 2)

Describe the work you have done this week and summarize your learning.

date()
## [1] "Wed Dec 01 14:46:51 2021"

Read data

students2014 <- read.table("./data/learning2014.txt", sep=",", header=TRUE, stringsAsFactors = T)

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

This data is a survey result on teaching and learning. Data was collected on 2014, including the infromation of participants, such as ‘gender’, ‘age’, ‘attitude’. In dataset, ‘deep’ is the mean points of students to answer the 12 deep questions, ‘stra’ is the mean points of students to answer the 12 strategic questions, ‘suf’ is the mean points of students to answer the 12 surface questions. ‘points’ are the exam grades of students. This data frame has 166 observations and 7 variables.

A graphical overview

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
pairs(students2014[2:7], col = students2014$gender)

ggpairs(students2014,  lower = list(combo = wrap("facethist", bins = 20)))

The above two figures show the relationships among 7 variables. From the ggpairs plot, it is clear that the significant correlations are tagged with asterisks ’*‘. All varibles are normally distributed, excepet the ’age’, the most people attending to survey are aged between 20 and 30. For ‘surf’ results, ‘attitude’, ‘deep’ and ‘stra’ have a significant correlations. the absolute value of ‘deep’ reaches to 0.324, which has the most significant influence to ‘surf’. With regarding to ‘points’, ‘attitude’ has the highest absolute value, 0.437, and the most significant correlationships with ‘points’.

A regression model

my_model <- lm(points ~ attitude + stra + surf, data = students2014)

summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In this part, ‘attitude’, ‘stra’ and ‘surf’ are used for linear regression because they are the first three hightest correlation values towards target variable ‘points’. From the summary of the model, only ‘attitude’ has a satistically significant relationship with ‘points’. ‘stra’ and ‘surf’ are not significantly correlated with ‘points’. Additionally, Intercept has a significant influence on the model, which is also the key factor in the model. Meanwhile, multiple R-squared is relatively low, 0.2074, meaning that the model is poorly correlated to the explanatory variables.

Optimization of model

optimized_model <- lm(points ~ attitude, data = students2014)

summary(optimized_model)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Just using the most significant variable ‘attitude’ as a explanatory variable in ‘optimized_model’, the residuals results are not changed too much comparing to the ‘my_model’ because the most influence factor is ‘attitude’. ‘attitude’ still has the highly significant correlationship with ‘points’. Multiple R-squared value is decreased from 0.2074 in ‘my_model’ to 0.1906 in ‘optimized_model’, which means the regression model does not fit well to the explanatory variables. Normally, the good Multiple R-squared of a model is over at least 0.7.

Model validation

par(mfrow = c(2,2))
plot(optimized_model, which = c(1,2,5))

‘optimized_model’ is applied for the model validation part to check the model assuptions. The basic properties of errors include that the errors are normally distributed, errors have a constant variance and errors are not correlated. Meanwhile, the size of a given error does not depend on the explanatory variables. According to the top left plot, the constant variance assumption is reasonable. All points are distributed randomly that represents the errors do not related to the explanatory variables. From the Normal Q-Q plot, it is a reasonable fit that the points are close to the fit line, which means the errors of model are normally distributed. Through the leverage of observations, the ‘attitude’ has a relatively high leverage that many regular errors with an outlier ranging from 0.02 to 0.04 on x-axis. Overall, the assumptions of model are validated well by these processes. However, this model is not good for the prediction of new data, because multiple R-squared values is much lower and few related explanatory variables. The model should be trained by more relevant data for further analysis.


Logistic regression (week 3)

date()
## [1] "Wed Dec 01 14:47:02 2021"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)
library(tidyr)
library(GGally)
library(boot)

Read data

setwd('//ad.helsinki.fi/home/z/zhangdon/Desktop/Introduction to open data science/IODS-project/data')

alc <- read.table("alc.txt", sep=",", header=TRUE, stringsAsFactors = T)

str(alc)
## 'data.frame':    370 obs. of  50 variables:
##  $ school       : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex          : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age          : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address      : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize      : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus      : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu         : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu         : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob         : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob         : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason       : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian     : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime   : int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime    : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures.math: int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup    : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup       : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid.math    : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities   : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery      : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet     : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic     : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel       : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime     : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout        : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc         : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health       : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences.math: int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1.math      : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2.math      : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3.math      : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ id.math      : int  2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ...
##  $ failures.por : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.por     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ absences.por : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1.por       : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2.por       : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3.por       : int  11 11 12 14 13 13 13 13 17 13 ...
##  $ id.por       : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ failures     : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid         : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ absences     : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1           : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2           : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3           : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use      : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use     : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ cid          : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
dim(alc)
## [1] 370  50
print(colnames(alc))
##  [1] "school"        "sex"           "age"           "address"      
##  [5] "famsize"       "Pstatus"       "Medu"          "Fedu"         
##  [9] "Mjob"          "Fjob"          "reason"        "guardian"     
## [13] "traveltime"    "studytime"     "failures.math" "schoolsup"    
## [17] "famsup"        "paid.math"     "activities"    "nursery"      
## [21] "higher"        "internet"      "romantic"      "famrel"       
## [25] "freetime"      "goout"         "Dalc"          "Walc"         
## [29] "health"        "absences.math" "G1.math"       "G2.math"      
## [33] "G3.math"       "id.math"       "failures.por"  "paid.por"     
## [37] "absences.por"  "G1.por"        "G2.por"        "G3.por"       
## [41] "id.por"        "failures"      "paid"          "absences"     
## [45] "G1"            "G2"            "G3"            "alc_use"      
## [49] "high_use"      "cid"

This data is the joint data sets merging Math course with Portuguese language course. To examine the alcohol consumption with other variables, alc_use is the average of the workday alcohol consumption and weekend alcohol consumption. high_use is TRUE when alc_use is higher than 2, otherwise is FALSE. For some of abbreviation, famrel is quality of family relationships, freetime is the free time after school, goout is going out with friends. Their values are ranging from 1 vary low to 5 very high. famsup is the boolean value which represents if the student could get the family educational support.

Choose the variables

Initially, it is hypothesized that family and entertainment are the key factors, which would impact the alcohol consumption of students. Thus, famrel, famsup and freetime, goout are chose for the logistic regression analysis.

gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

From the Key-value pairs result, most of students are aged from 15 to 18 years old. The legal drinking age is usually over 18 in many countries. Those who under 18 years-old consumed much alcohol is largely influenced by their family or their peers.

Exploration of the variables

famrel1 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex)) + geom_boxplot()
famsup1 <- ggplot(alc, aes(x = high_use, y = famsup, col = sex)) + geom_boxplot()
freetime1 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) + geom_boxplot()
goout1 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot()

ggarrange(famrel1, famsup1, freetime1, goout1,
          ncol = 2, nrow = 2)

Observing the family relationships with high_use, most of high_use students also have a good relations with family members. Just looking into the boxplots, boys have more free time than girls so they might like to drink much more. For those who go out with friends more times per week, they are easily drunk much more. Because family educational support is the Boolean values, it is not clear just observing from the box plots.

After the below numerical analysis of selected variables with high alcohol consumption, most of students do not drink too much. Observing the highest number of high used students in these 4 dimensions. 77 of students are high used with 4 degree of family relationship. The number of high used students with family educational support is 66, which is higher than those who without family support, 45. According to free time, the numbers of high used students at 3 and 4 degree are similar, both of them are 40. The highest number of high used student is 38 at 4 degree. However, the highest number of non-high used student is 97 at 3 degree. Only based on these numerical and graphical analysis, it does not conclude any relationships between the variables and alcohol consumption. Therefore, logistic regression should be operated to provide for more information.

alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'famrel'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##     <int> <lgl>    <int>
##  1      1 FALSE        6
##  2      1 TRUE         2
##  3      2 FALSE        9
##  4      2 TRUE         9
##  5      3 FALSE       39
##  6      3 TRUE        25
##  7      4 FALSE      128
##  8      4 TRUE        52
##  9      5 FALSE       77
## 10      5 TRUE        23
alc %>% group_by(famsup, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'famsup'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   famsup [2]
##   famsup high_use count
##   <fct>  <lgl>    <int>
## 1 no     FALSE       94
## 2 no     TRUE        45
## 3 yes    FALSE      165
## 4 yes    TRUE        66
alc %>% group_by(freetime, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'freetime'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups:   freetime [5]
##    freetime high_use count
##       <int> <lgl>    <int>
##  1        1 FALSE       15
##  2        1 TRUE         2
##  3        2 FALSE       46
##  4        2 TRUE        14
##  5        3 FALSE      112
##  6        3 TRUE        40
##  7        4 FALSE       65
##  8        4 TRUE        40
##  9        5 FALSE       21
## 10        5 TRUE        15
alc %>% group_by(goout, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'goout'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       82
##  4     2 TRUE        15
##  5     3 FALSE       97
##  6     3 TRUE        23
##  7     4 FALSE       40
##  8     4 TRUE        38
##  9     5 FALSE       21
## 10     5 TRUE        32

Logistic regression model

m <- glm(high_use ~ famrel + famsup + freetime + goout, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + famsup + freetime + goout, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7364  -0.7799  -0.5596   0.9979   2.4376  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1950     0.6982  -3.144  0.00167 ** 
## famrel       -0.4191     0.1371  -3.056  0.00224 ** 
## famsupyes    -0.1900     0.2542  -0.747  0.45482    
## freetime      0.2014     0.1364   1.477  0.13966    
## goout         0.7405     0.1233   6.006  1.9e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 391.89  on 365  degrees of freedom
## AIC: 401.89
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)      famrel   famsupyes    freetime       goout 
##  -2.1949968  -0.4191495  -0.1899943   0.2014051   0.7405345

From the summary results of the model, family relationships and go out times are significantly influence the high alcohol consumption in students. Go out times is the most significant factor, that is 1.9e-09. Family educational support and free time are not significant according to the results. The coefficient values of model is similar to the estimate values from summary function. family relationships and family educational support are minus value (-0.4191 and -0.1899), which means that they are negative associated with the target variable (high alcohol consumption). Whereas free time and go out are the positive associated with target variable, 0.2014 and 0.74005, respectively.

Odds ratios and confidence intervals

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1113589 0.0272651 0.4247412
## famrel      0.6576059 0.5006796 0.8588025
## famsupyes   0.8269639 0.5027027 1.3646341
## freetime    1.2231201 0.9373983 1.6021304
## goout       2.0970561 1.6575495 2.6907078

As known, the odds ratios are calculated from the exponents of the coefficients. Those values are larger than 1 can be considered as the positive associated values, otherwise those less than 1 are negative associated values. For family factors, the better family relationships could reduce the probability of high alcohol consumption by 34.2% (1-0.658), the students without family support have 17.3% (1-0.827) higher odds to consume much more alcohol than students with family educational support. For the entertainment factors, the probability of high alcohol consumption will be increased by 1.223 times with the more free time for student, go out times will grow the probability of high alcohol consumption by 2.097 times. Confidence interval of this model is set by 97.5% for each variables, it can be intercepted by that we are 97.5% confident that variables impact the high alcohol consumption in students.

Then, to examine the exact coefficients of each variables, the intercept in the model has been removed in the below codes. It is definite that family educational support seems to have a significant influence on the highly usage of alcohol of students. However, free time is still not the significant factor in the model. So, free time should be check again if it would have a statistically significance in the model.

m1 <- glm(high_use ~ famrel + famsup + freetime + goout - 1, data = alc, family = "binomial")

summary(m1)
## 
## Call:
## glm(formula = high_use ~ famrel + famsup + freetime + goout - 
##     1, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7364  -0.7799  -0.5596   0.9979   2.4376  
## 
## Coefficients:
##           Estimate Std. Error z value Pr(>|z|)    
## famrel     -0.4191     0.1371  -3.056 0.002241 ** 
## famsupno   -2.1950     0.6982  -3.144 0.001668 ** 
## famsupyes  -2.3850     0.6818  -3.498 0.000469 ***
## freetime    0.2014     0.1364   1.477 0.139657    
## goout       0.7405     0.1233   6.006  1.9e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 391.89  on 365  degrees of freedom
## AIC: 401.89
## 
## Number of Fisher Scoring iterations: 4
coef(m1)
##     famrel   famsupno  famsupyes   freetime      goout 
## -0.4191495 -2.1949968 -2.3849911  0.2014051  0.7405345
m2  <- glm(high_use ~ famrel + famsup + goout, data = alc, family = "binomial") 
m3  <- glm(high_use ~ famrel + freetime +goout, data = alc, family = "binomial")

anova(m, m2, test = 'LRT')
## Analysis of Deviance Table
## 
## Model 1: high_use ~ famrel + famsup + freetime + goout
## Model 2: high_use ~ famrel + famsup + goout
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     391.89                     
## 2       366     394.09 -1  -2.1989   0.1381
anova(m, m3, test = 'LRT')
## Analysis of Deviance Table
## 
## Model 1: high_use ~ famrel + famsup + freetime + goout
## Model 2: high_use ~ famrel + freetime + goout
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     391.89                     
## 2       366     392.45 -1 -0.55693   0.4555

According to the analysis of model with and without free time, the likelihood ratio test is 0.1381, which is not significant to the model. Also, comparing with the models containing family support or not, the likelihood ratio test Free time is 0.4555. Free time and family support are not the significant factors influencing the high alcohol consumption of students. Thus, both of them can be removed from the logistic regression model for increasing the accuracy of prediction.

Prediction validation and loss functions

probabilities <- predict(m, type = "response")

alc0 <- mutate(alc, probability = probabilities) %>% mutate(alc, prediction = probabilities > 0.5)

g <- ggplot(alc0, aes(x = probability, y = high_use, col = prediction))

g + geom_point()

At first, the initial four variables (family relationships, family educational support, free time and go out) are used for logistic regression and the whole data is used for the predication validation. From the plot, the points of negative false seems larger than negative true. So, the exact prediction results should be examined by the table function.

select(alc0, age, freetime, failures, goout, high_use, probability, prediction) %>% tail(10)
##     age freetime failures goout high_use probability prediction
## 361  17        4        0     3    FALSE  0.30061496      FALSE
## 362  19        3        1     2    FALSE  0.14352564      FALSE
## 363  18        4        1     3     TRUE  0.22036842      FALSE
## 364  18        3        0     3    FALSE  0.18771510      FALSE
## 365  18        4        0     3    FALSE  0.26223864      FALSE
## 366  19        4        1     2    FALSE  0.11877781      FALSE
## 367  18        3        0     4    FALSE  0.37866087      FALSE
## 368  18        1        0     1    FALSE  0.15813018      FALSE
## 369  17        4        0     5     TRUE  0.81381569       TRUE
## 370  18        4        0     1     TRUE  0.08903782      FALSE
table(high_use = alc0$high_use, prediction = alc0$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   237   22
##    TRUE     72   39
table(high_use = alc0$high_use, prediction = alc0$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64054054 0.05945946 0.70000000
##    TRUE  0.19459459 0.10540541 0.30000000
##    Sum   0.83513514 0.16486486 1.00000000

As shown by the above two tables, the accuracy of prediction can be computed by (total negative false + total positive true) divided by total number of students. Thankfully, the prop.table function has computed this probability. The accuracy of the model is 0.6405 + 0.1054 equal to 0.7459. It means that the prediction accuracy of this model is 74.59%.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc0$high_use, prob = alc0$probability)
## [1] 0.2540541

Loss function is a easy way for prediction validation. As shown by the above result, loss function is 0.2541, the accuracy is 0.7459 (1-0.2541). The accuracy is consistent with the above handle calculated value from the table results.

From the analysis on the last step, free time is not significant and it can be removed from the logistic regression model here. Additionally, family educational support is not significant either from the Therefore, two variables (family relationships and go out) are used for the modelling and the whole data is used for prediction validation.

m4  <- glm(high_use ~ famrel + goout, data = alc, family = "binomial")

probabilities1 <- predict(m4, type = "response")

alc1 <- mutate(alc, probability = probabilities1) %>% mutate(alc, prediction = probabilities1 > 0.5)

loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2432432

As we see, the loss of the model (m4) is decreased compared to the model with all four variables (m), from 0.2541 to 0.2432. The accuracy of the model prediction is increased by selecting the significant factors.

Cross validation

loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2432432
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cv$delta[1]
## [1] 0.2540541

After cross validation, the model using 10-fold cross-validation seems not perform better than the model not using cross validation. From the values, the prediction accuracy is always larger than the loss of the initial model after cross validation. Because the data is divided into 10 parts for training and testing, the accuracy value is much reasonable by cross validation although it is relatively higher.

crossvalidation <- function(model){
  cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
  cv$delta[1]
}


accuracy <- sapply(1:100, function(n)crossvalidation(m4))

accuracy
##   [1] 0.2378378 0.2621622 0.2432432 0.2567568 0.2459459 0.2594595 0.2513514
##   [8] 0.2594595 0.2540541 0.2702703 0.2567568 0.2459459 0.2486486 0.2432432
##  [15] 0.2702703 0.2729730 0.2567568 0.2594595 0.2567568 0.2540541 0.2567568
##  [22] 0.2540541 0.2540541 0.2567568 0.2675676 0.2567568 0.2459459 0.2621622
##  [29] 0.2567568 0.2540541 0.2540541 0.2594595 0.2648649 0.2486486 0.2486486
##  [36] 0.2459459 0.2648649 0.2513514 0.2540541 0.2594595 0.2648649 0.2540541
##  [43] 0.2513514 0.2432432 0.2621622 0.2567568 0.2594595 0.2540541 0.2513514
##  [50] 0.2432432 0.2567568 0.2567568 0.2594595 0.2540541 0.2513514 0.2486486
##  [57] 0.2675676 0.2648649 0.2513514 0.2540541 0.2513514 0.2594595 0.2675676
##  [64] 0.2594595 0.2567568 0.2675676 0.2459459 0.2567568 0.2486486 0.2540541
##  [71] 0.2459459 0.2540541 0.2594595 0.2405405 0.2567568 0.2567568 0.2648649
##  [78] 0.2513514 0.2594595 0.2594595 0.2540541 0.2513514 0.2675676 0.2540541
##  [85] 0.2567568 0.2459459 0.2540541 0.2621622 0.2702703 0.2621622 0.2567568
##  [92] 0.2621622 0.2567568 0.2513514 0.2621622 0.2567568 0.2594595 0.2675676
##  [99] 0.2432432 0.2513514
v <- data.frame(accuracy)
ggplot(v, aes(x = as.numeric(row.names(v)), y = accuracy, group = as.numeric(row.names(v)))) + 
  geom_point(aes(color = ifelse(accuracy>0.2432432, 'red', 'blue'))) + 
  scale_colour_manual(labels = c("<0.2432432", ">0.2432432"), values=c('red', 'blue')) +
  labs(color = "Values") +labs(x = 'cycles of cross validation', y = 'Loss of models')

Running 100 cycles of cross validation for model, there are few values of cross validation are less than loss function value of the whole data validation. However, it demonstrates that the predication accuracy by cross validation is at a range of scale due to the random separation of training and testing parts.

To examine the trend about different numbers of predictors in the model. The ten variables are selected randomly and then change to only one variable for the logistic regression model. The cross validation results are calculated based on the different numbers of variables.

# generate the models with different numbers of predictors
mm10 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher + absences + failures + activities, data = alc, family = "binomial") 
mm9 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher + absences + failures, data = alc, family = "binomial") 
mm8 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher + absences, data = alc, family = "binomial") 
mm7 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher, data = alc, family = "binomial") 
mm6 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age, data = alc, family = "binomial") 
mm5 <- glm(high_use ~ famrel + famsup + freetime + goout + sex, data = alc, family = "binomial") 
mm4 <- glm(high_use ~ famrel + famsup + freetime + goout, data = alc, family = "binomial") 
mm3 <- glm(high_use ~ famrel + famsup + freetime, data = alc, family = "binomial") 
mm2 <- glm(high_use ~ famrel + famsup, data = alc, family = "binomial") 
mm1 <- glm(high_use ~ famrel, data = alc, family = "binomial")

cv10 <- c(crossvalidation(mm10), crossvalidation(mm9), crossvalidation(mm8), crossvalidation(mm7), crossvalidation(mm6), crossvalidation(mm5), crossvalidation(mm4), crossvalidation(mm3), crossvalidation(mm2), crossvalidation(mm1))

No_predictor <- 10:1

supur_bonus <- data.frame(No_predictor, cv10)

ggplot(supur_bonus, aes(x = No_predictor, y = cv10)) + geom_line() + labs(x = 'Numbers of predictors', y = 'Loss of models') + xlim(10.5, 0.5)

According to the above trend analysis, the few predictors, the higher loss of models. So the accuracy of the prediction depends on the numbers of variables in the model. However, more variables do not mean the better prediction accuracy, other insignificant predictors could decrease the accuracy of the logistic regression model. Searching all significant variables for the logistic regression model would be generate the most accurate prediction.


Clustering and classification (week 4)

date()
## [1] "Wed Dec 01 14:47:34 2021"
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.91 loaded
library(tidyr)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(GGally)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Load the Boston data

data('Boston')

The data Boston has 506 observations and 14 variables, which is the survey about housing value in suburbs of Boston. crim means per capita crime rate by town; zn is the proportion of residential land zoned for lots over 25,000 sq.ft; indus is the proportion of non-retail business acres per town. The details of the description can be found on the MASS website (click here)[https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html].

Overview of the data

cor_matrix<-cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

chart.Correlation(Boston, histogram=TRUE, pch=19)

As shown on the above two figures, the correlations that the p-value > 0.01 are considered as significant. The cycles in the first one represent the significant values between two variables. Insignificant ones are shown as the blank. The blue cycles are positive correlations and red color symbols are the negative correlations. From the results of the second figure, most of variables are significant correlations with each other except chas which means Charles River dummy variable. chas does not impact other variables according to correlation matrix.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the summary, the values are varies based on their judgement. For example, the crime rate is ranging from 0.00632 to 88.97620, whereas rm (average number of rooms per dwelling) has a narrow range, from 3.561 to 8.780. tax has the relatively large values compared to other variables. Thus, it is necessary to standardize the whole data.

Standardization of the data

boston_scaled <- scale(Boston)
print(summary(boston_scaled))
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Standardization is the important step for the data clustering, \(scaled(x) = \frac{x - mean(x)}{sd(x)}\). After scaling, the whole data is grouped into a small range of scale, which prevents some variables with larger scales from overweight the other variables.

Speration of training and testing data for crime rate

# creating a categorical variable of the crime rate
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# removing the old crime rate variable and adding the new quantile of 'crime' variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

# dividing the data into training and testing group
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

lda.fit <- lda(crime ~ . , data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the liner discriminant analysis plot, the high crime rate group is separated from the low crime rate and most of mediate high crime rate groups. rad (index of accessibility to radial highways) is more discrimination, which has a larger standard deviation than other variables. The angles between arrows represent the relationships between them. There are 14 variables in training data, so it is hard to distinguish the differences among them. However, it can be concluded that most of variables are correlated with each other.

Prediction of the LDA model

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  -0.4872 0.0487 0.0487 -0.4872 -0.4872 ...
##  $ indus  : num  -1.306 -0.476 -0.476 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.834 -0.265 -0.265 -0.144 -0.144 ...
##  $ rm     : num  1.227 -0.93 0.131 -0.641 -0.498 ...
##  $ age    : num  -0.511 1.116 0.914 -0.429 -1.395 ...
##  $ dis    : num  1.077 1.086 1.212 0.334 0.334 ...
##  $ rad    : num  -0.752 -0.522 -0.522 -0.637 -0.637 ...
##  $ tax    : num  -1.105 -0.577 -0.577 -0.601 -0.601 ...
##  $ ptratio: num  0.113 -1.504 -1.504 1.175 1.175 ...
##  $ black  : num  0.441 0.328 0.393 0.427 0.331 ...
##  $ lstat  : num  -1.025 2.419 1.092 -0.586 -0.85 ...
##  $ medv   : num  1.486 -0.6559 -0.819 -0.2863 0.0617 ...
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
##           predicted
## correct            low     med_low    med_high        high         Sum
##   low      0.117647059 0.058823529 0.019607843 0.000000000 0.196078431
##   med_low  0.039215686 0.254901961 0.049019608 0.000000000 0.343137255
##   med_high 0.000000000 0.078431373 0.127450980 0.009803922 0.215686275
##   high     0.000000000 0.000000000 0.000000000 0.245098039 0.245098039
##   Sum      0.156862745 0.392156863 0.196078431 0.254901961 1.000000000

According to the prediction results, the accuracy of the model is 0.657 (0.147 + 0.127 + 0.118 + 0.265).

Distance measures and k-means clustering

data('Boston')

re_scaled <- data.frame(scale(Boston))

dist_eu <- dist(re_scaled)

summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Because euclidean distance is the common and popular way to calculate the variable distances, which is applied for here. After distance measurement of the scaled variables, the minim distance is 0.1343 while the maxim one is 14.370. The median and mean distances are 4.8241 and 4.9111, respectively.

# determining the numbers of K-means
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(re_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Through the visualization of the trend of K numbers, it drops radically when k is 3. Thus, k number is selected as 3 for k-means clustering.

# choosing k = 3
km <-kmeans(re_scaled, centers = 3)
cluster <- as.factor(km$cluster)
ggpairs(re_scaled, aes(colour = cluster, alpha = 0.4), 
columns = colnames(re_scaled),
upper = "blank",
diag = NULL)

There are three different colors in the plot due to K number is 3. For most of clusters, the data can be clustered into 3 separated groups. However, 3 centers are not clear in some figures, such as tax vs age and ptratio vs age.

Bonus

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

If directly using Boston data to find the optimized number of K. According to the plot, the trend becomes relatively stable after k is 3.

# choose k number and create a dataset with cluster
km <-kmeans(Boston, centers = 3)

newboston <- Boston

newscaled <- scale(newboston)
newscaled <- data.frame(newscaled)

newscaled$cluster <- km$cluster

# create the training and testing data
n <- nrow(newscaled)

ind <- sample(n,  size = n * 0.8)

train1 <- newscaled[ind,]

test1 <- newscaled[-ind,]

correct_classes <- test1$cluster

test1 <- dplyr::select(test1, -cluster)

# lda analysis
lda.fit <- lda(cluster ~ . , data = train1)
newclass <- as.numeric(train1$cluster)
plot(lda.fit, dimen = 2, col = newclass, pch = newclass)
lda.arrows(lda.fit, myscale = 1)

On the above figure, three different groups of clusters are separated clearly. tax and rad have the long arrows which means that they are the most influential linear separator for the clusters.

Super-Bonus

model_predictors <- dplyr::select(train, -crime)
lda.fit <- lda(crime ~ . , data = train)
km <-kmeans(model_predictors, centers = 3)

dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)

From the two plots, there are two clear clusters which are separated from each other. When using crime as the indicator, the group of high crime rate is clustered with few cases of mediate high crime rate. Most of mediate high crime rate with other groups are clustered together. It means the reasons causing high crime rate are different from other types of crime rates. Because K number is selected as 3 based on the previous optimization analysis of K-numbers, the model is easily clustered to 3 different groups according to their correlations. However, some of points seeming closer to K-1 and K-2 groups are statistically related to K-3 closer.


Dimentionality reduction techniques (week 5)

date()
## [1] "Wed Dec 01 14:48:14 2021"
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyr)
library(ggpubr)
library(RColorBrewer)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

The graphical overview of the data

setwd('//ad.helsinki.fi/home/z/zhangdon/Desktop/Introduction to open data science/IODS-project/data')
human <- read.table('human.csv', sep=',', header=TRUE, row.names = 1)

ggpairs(human)

cor(human) %>% corrplot()

There are 195 observations and 19 variables in the human data. According the instructions from last week, the column names of data had been shortened. For example, + gni_per is gross national income per capita; + le_birth is life expectancy at birth; + ey_edu is expected years of schooling; + mm_ratio is maternal mortality ratio; + ado_br is adolescent birth rate; + pp_parlia is percentage of female representatives in parliament; + edu2f is proportion of females with at least secondary education; + edu2m is proportion of males with at least secondary education; + labf is proportion of females in the labor force; + labm is proportion of males in the labor force; + edu2_ratio is ‘edu2f’ / ‘edu2m’; + lab_ratio is ‘labf’ / ‘labm’.

From ggpairs plot, most of the variables are normally distributed and significantly correlated with each other. The corrplot figure shows that maternal mortality ratio and adolescent birth rate have the negative correlations with most of variables. Red color represents the negative correlation while blue one means the positive correlation.

PCA on the not scaled data

pca_no_scaled <- prcomp(human)
s <- summary(pca_no_scaled)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
pca_pr_no_scaled <- round(100*s$importance[2, ], digits = 1)
pc_lab_no_scaled <- paste0(names(pca_pr_no_scaled), " (", pca_pr_no_scaled, "%)")
biplot(pca_no_scaled, , cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_no_scaled[1], ylab = pc_lab_no_scaled[2])

PCA on the scaled data

human_std <- scale(human)
summary(human_std)
##    edu2_ratio        lab_ratio          le_birth           ey_edu       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             mm_ratio           ado_br          pp_parlia      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_scaled <- prcomp(human_std)
s1 <- summary(pca_scaled)
s1
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
pca_pr_scaled <- round(100*s1$importance[2, ], digits = 1)
pc_lab_scaled <- paste0(names(pca_pr_scaled), " (", pca_pr_scaled, "%)")
biplot(pca_scaled, , cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_scaled[1], ylab = pc_lab_scaled[2])

Compared with plots with and without standardization, the scaled data show the reasonable consequence about the correlation among variables. In the not-scaled figure, GNI is the main factor for PCA, the priciple componet 1 accounts for 100% of all principle components. However, in the scaled figure, PC1 is 53.6%, pc2 is 16.2%. All scaled variables seem reasonable. pp_parlia and lab_ratio are mostly contributed to the PC2 and have the higher positive correlations, the female ratio is the main point in these two variables. Others are related to PC1. edu2_ratio, ey_edu, GNI and le_birth are correlated with each other, whereas mm_ratio and ado_br are closer because the death of mother would cause the birth of new infants.

Principal component dimensions

s # not scaled
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
s1 # scaled
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

According to the importance of components, the standard deviation and proportion of variance for principle components are the highest. From the biplot figure, 6 of variables are correlated to the PC1 because the angles between are relatively smaller. The length of GNI and edu2_ratio arrows are shorter than others, which means that their standard deviations are smaller than others accordingly.

MCA on the tea dataset

data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300   6
gather(tea_time) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

There are 300 observations and 6 variables in tea_time data. All of variables are the factor type data which are categorized into 2-4 levels. From the ggplot figure, tea bag is the popular way for tea drinking. Most of people like drinking tea without any additions and not with lunch. There is no difference from sugar and no sugar, however, the most favorite tea and drinking area are Earl Grey and chain store, respectively.

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

After Multiple Correspondence Analysis, The eigenvalues, individuals and categories of each dimensions can be summarized.

fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))

var <- get_mca_var(mca)

15.2% variances are accounting for dimension1, then 14.2 for dimension2 and 12% for dimension3. The change of explained variances in each dimensions are not dropped sharply, the differences among them are slight. Thus, it is necessary to explore the details of different dimensions.

fig1 <- fviz_contrib(mca, choice = "var", axes = 1, top = 15)
fig2 <- fviz_contrib(mca, choice = "var", axes = 2, top = 15)
fig3 <- fviz_contrib(mca, choice = "var", axes = 3, top = 15)
fig4 <- fviz_contrib(mca, choice = "var", axes = 4, top = 15)

ggarrange(fig1, fig2, fig3, fig4,
          ncol = 2, nrow = 2)

In dimension1, tea shop and unpacked tea are the main contributors. In dimension2, chain store + tea shop and tea bag + unpackaged are the most important factors. In dimension3 and 4, type of tea is the significant contributor, black tea and green tea have the highest contributions in these two dimensions separately.

fviz_contrib(mca, choice = "var", axes = 1:4, top = 15)

Among the first 4 dimensions, tea shop, unpacked tea, chain store + tea shop, tea bag + unpackaged have the higher contribution.

fviz_mca_var(mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE,
             ggtheme = theme_minimal())

According to cos2 values, when it is closer to 1 means the strong correlations between the variable and dimension. On the above figure, chain store + tea shop and tea bag + unpackaged, tea bag and chain store, unpackaged and tea shop are the most quality variables to dimensions.

plot(mca, invisible=c("ind"), habillage = "quali", cex = 1.3, palette=palette(brewer.pal(8, 'Dark2')))

Definitely, two groups (how - the pack of tea and where - the drinking area) are much similar with each other.


(more chapters to be added similarly as we proceed with the course!)